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ABSTRACT 

Regret minimizing sets are a very recent approach to representing 
a dataset D with a small subset S of representative tuples. The set 
S is chosen such that executing any top-1 query on S rather than D 
is minimally perceptible to any user. To discover an optimal regret 
minimizing set of a predetermined cardinality is conjectured to be 
a hard problem. In this paper, we generalize the problem to that of 
finding an optimal fc-regret minimizing set, wherein the difference 
is computed over top-fc queries, rather than top-1 queries. 

We adapt known geometric ideas of top-fc depth contours and the 
reverse top-fc problem. We show that the depth contours themselves 
offer a means of comparing the optimality of regret minimizing sets 
with L-2 distance. We design an 0(cn 2 ) plane sweep algorithm for 
two dimensions to compute an optimal regret minimizing set of car- 
dinality c. For higher dimensions, we introduce a greedy algorithm 
that progresses towards increasingly optimal solutions by exploit- 
ing the transitivity of L2 distance. 

Categories and Subject Descriptors 

H. 3.3 [Information Storage and Retrieval]: Information Search 
and Retrieval; F.2.2 [Analysis of Algorithms and Problem Com- 
plexity]: Nonnumerical Algorithms and Problems — geometrical 
problems and computations 

General Terms 

Algorithms, Theory 

Keywords 

regret, representative databases, top-k, arrangement of lines, plane 
sweep 

I. INTRODUCTION 

For a user navigating a large dataset, the availability of a succinct 
representative subset of the data is crucial. For example, consider 
Table Q] a toy, but real, dataset consisting of the top eight scoring 
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id 


player name 


points 


rebs 


steals 


fouls 


1 


Kevin Durant 


2472 


623 


112 


171 


2 


LeBron James 


2258 


554 


125 


119 


3 


Dwyane Wade 


2045 


373 


142 


181 


4 


Dirk Nowitzki 


2027 


520 


70 


208 


5 


Kobe Bryant 


1970 


391 


113 


187 


6 


Carmelo Anthony 


1943 


454 


88 


225 


7 


Amare Stoudemire 


1896 


732 


52 


281 


8 


Zach Randolph 


1681 


950 


80 


226 



Table 1: Statistics for the top eight NBA point 

scorers from the 2009 regular season, taken from 

databasebasketball . com, The top score in each 
statistic is bolded. 



NBA players from the 2009 basketball season. A user viewing 
this data would typically be curious which of these eight players 
were "top of the class" that season. That is, he is curious which 
few tuples best represent the entire dataset, without his having to 
peruse it in entirety. 

A well-established approach to representing a dataset is with the 
skyline operator which returns all pareto-optimal pointsQ The in- 
tention of the skyline operator is to reduce the dataset down to only 
those tuples that are guaranteed to best suit the preferences or in- 
terests of somebody. If the toy dataset in Table Q] consisted only 
of the attributes points and rebounds, then the skyline would con- 
sist only of the players Kevin Durant, Amare Stoudemire, and Zach 
Randolph, so these three players would represent well what are 
the most impressive combinations of point-scoring and rebound- 
ing statistics. The skyline is a powerful summary operator only on 
low dimensional datasets, however; even on this toy example, ev- 
erybody is in the skyline if we consider all four attributes. In gen- 
eral, there is no guarantee that the skyline is an especially succinct 
representation of the dataset. 

Regret 

A promising new alternative is the regret minimizing set, introduced 
by Nanongkai et al. 1141 . which hybridizes the skyline operator 
with top-fc queries. A top-fc query takes as input a utility function 
/ and evaluates each tuple according to /, reporting the fc tuples 
with highest values. FigureQ]shows how highly three of the points 
rank for a user utility function of /(pts, rebs) = (pts + rebs) /2, 
if the attributes are normalized. The distance from the orthogonal 
line of each point is proportional to the point's score for that user 
weight. This reveals that Randolph earns the highest normalized 

1 Pareto-optimal points are those for which no other point is higher 
ranked with respect to every attribute. 



score (0.840), compared to Kevin Durant (0.828) and then Kobe 
Bryant (0.604). 

To evaluate whether a subset effectively represents the entire 
dataset well, Nanongkai et al. introduce regret ratio as the ratio 
of how far from the best score in the dataset is the best score in that 
subset, with respect to a given utility function. Graphically, this 
is proportional to how much smaller than the largest arrow is the 
largest arrow in the subset. For the subset {Bryant, Durant}, the 
regret ratio is: 

(0.840 - 0.828)/0.840 = 0.0143, 

since the score for Randolph is the best in the dataset at 0.840, and 
the score for Durant is the best in the subset at 0.828. 
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Figure 1: A utility function /(pts, rebs) = (pts + rebs)/2 rep- 
resented as a vector / = (.5, .5), and three data points from 
Table [T] shown with their scores being proportional to the dis- 
tance from the line orthogonal to /. 

Motivated to derive a succinct representation of a dataset, one 
with fixed cardinality, Nanongkai et al. introduce regret minimizing 
sets 1141 , posing the question, "Does there exist one set of c tuples 
that makes every user at least x% happy?" A regret minimizing set 
is a subset of a dataset that minimizes the regret ratio. 

A linear top-fc query can be considered as a problem of projec- 
tion (6), where each tuple is regarded as a vector, as is the util- 
ity function. The score of a tuple is proportional to the size of its 
projection onto the utility function vector and its scores for every 
possible utility function trace a (hyper-)sphere emanating from the 
point. Minimizing regret ratio is equivalent to finding a subset that 
minimizes the maximum distance between the "best" spheres in the 
subset and the "best" spheres in the entire dataset, as illustrated in 
Figure[2] Of the eight basketball players of TablefTJ Zach Randolph 
achieves this criteria, so he is the optimal regret minimizing set of 
order (i.e., size) 1. 

Randolph, the optimal regret minimizing tuple, however, is a pe- 
culiar choice to represent the dataset of Table[TJsince he is the worst 
rated with respect to points. This exposes a weakness of regret min- 
imizing sets: they are based on assuming that a "happy" user is one 
who obtains their absolute top choice. However, for an analyst cu- 
rious to know what is a high point-scoring player, is he really dis- 
satisfied with LeBron James as a query response rather than Kevin 
Durant? 

To change the scenario a bit, consider a dataset of hotels and a 
user searching for one that suits his preferences. The absolute top 
theoretical choice may not suit him especially well at all. It could 
be fully booked. Or, he might know that the manager reminds him 
of his ex-wife. Regardless, it makes sense to present him a few, say 
fc, options, any of with which he would be happy. 

We generalize the concept of regret and of regret minimizing 
sets to that of fc-regret, analogous to the difference between top-fc 
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Figure 2: The spheres of scores for Durant, Bryant, and Ran- 
dolph for rebounds as the a-attribute and points as the y- 
attribute. The regret ratio of a subset S is (roughly) the ratio 
of the distance from the best possible score in S from the best 
possible score on the entire dataset. For Durant and Bryant, 
this is maximized on the a-axis (where the best score is instead 
Randolph), and for Randolph, conversely, this is maximized on 
the y-axis (where the best score is instead Durant). 



queries and top-1 queries, because top-fc is often a better threshold 
for "happiness". The analogous problem is to find a subset S of 
points in the dataset that minimize the distance from the best point 
in S to the fc'th best point in the entire dataset. This relaxation 
prevents having to fit an outlier tuple like Randolph. 

Optimality 

A fundamental open question remains with regards to both the prob- 
lem introduced by Nanongkai et al. and our generalisation of it. 
How can one efficiently compute the optimal fc-regret minimizing 
set of a predetermined cardinality c, the subset that achieves the 
minimal regret ratio of all size c subsets of the dataset? This is a 
problem conjectured to be NP-Hard by Nanongkai et al. for k = 1: 
it involves searching for the best among 0(n c ) different subsets. 

We introduce algorithms that strive to compute optimal fc-regret 
minimizing sets. Towards this end, we relate the recent work on 
top-fc depth contours of Chester et al. (7) for the reverse top-fc 
problem of Vlachou et al. 119| . The top-fc depth contours are a 
dual space, geometric idea that succinctly represent exactly the fc'th 
ranked tuple for all utility functions. We demonstrate that these 
ideas are directly applicable to discovering optimal fc-regret min- 
imizing sets. For instance, if the cardinality restriction is lifted, 
then the contours are precisely the optimal solutions (Lemma [3. It . 
In the presence of cardinality restrictions, the contours still aid in 
finding optimal solutions (Theorem[TJ. 

1.1 Contributions and Outline 

In this paper we propose the first algorithms for computing opti- 
mal fc-regret minimizing sets. In particular, we: 

• generalize regret and regret minimizing sets, top-1 concepts, 
to those of k-regret and k-regret minimizing sets, top-fc con- 
cepts (Section[2]l; 

• identify that apparently unrelated work on top-fc depth con- 
tours and reverse top-fc queries Q sheds insight into the 
problem of identifying optimal fc-regret minimizing sets (Sec- 
tion[3j; 

• introduce an 0(n 2 c) algorithm to compute the optimal size- 
c fc-regret minimizing subset S of a two-dimensional dataset 
D for the family of positive linear functions C + , despite a 
conjecture by Nanongkai et al. 1141 that the general dimen- 
sion problem is intractible for fc = 1 (Section|4j; 



• introduce a greedy algorithm for general dimensions that lever- 
ages the relationship between top-fc depth contours and op- 
timality in order to progress towards more optimal solutions 
(Section[5](; and 

• relate our work within the context of other literature (Sec- 
tion |6). 

2. PRELIMINARIES 

In the following sections, we will demonstrate how to compute 
optimal fc-regret minimizing sets by equating the problem to one in 
dual space. Within the dual space, the optimal solution is the top-fc 
depth contour if it small enough, or else the convex chain through 
the arrangement of lines in dual space that minimizes a particu- 
lar distance ratio. Before embarking on these objectives, however, 
we introduce some concepts formally in four subsections: first, fc- 
regret and fc-regret minimizing sets (Section [2.1| l; next, the trans- 
formation of the data into an arrangement of lines in dual space 
and some of the tools therein that we use (Section [2.2b : penulti- 
mately, the top-fc depth contours that exist in the arrangement of 
lines and are fundamentally connected to finding optimal fc-regret 
minimizing sets (Section [2.3t ; and, finally, the problem definition 
under study (Section [2.4t . Throughout the paper, we consider the 
family of positive linear functions C + , which, without loss of gen- 
erality, can be reduced to the family of positive unit linear functions 
IA + J6|. Nonetheless, we present the definitions for general family. 

2.1 k-Regret 

k-Regret, introduced here, is a generalisation of regret, intro- 
duced by Nanongkai et al. 1141 . We recall the definitions from that 
paper and introduce the generalisation in this subsection. 

Given a dataset D of n d-dimensional numeric tuples, a subset 
S C D, a family of utility functions T, and a utility function / £ 
T: 

DEFINITION 2. 1 (GAIN fT4l ). The gain/or a subset S C D 
on f € T is: 

gain(S,f) = max pSS /(p). 

That is to say, the gain of a subset S, given a utility function /, is 
simply the highest score achievable in S for the function /. Recall- 
ing the example of Table[T]and the utility function /(pts, rebs) = 
(pts + rebs)/2, and assuming the data is normalized, the gain of 
{Bryant, Durant, Wade} is 0.828. The generalisation of gain is 
to k-gain: 

DEFINITION 2.2 (fc-GAIN). Consider a descending order list 
of f(p) for all p eSCD, given f 6 T. Then, the fc-gain of S on 
f is simply the k 'th value in the list. 

In other words, the k-gain for a subset S C D is the fc'th best 
score achieved by a point in S on the utility function /. For the 
subset S = {Bryant, Durant, Wade} and the same function /, 
the 2-gain is the second best score, 0.748, the score for Durant. 
For k = 1, this definition is equivalent to Definition ^. 11 

Regret ratio, then, is a reflection of how well the gain of a subset 
approaches that of the entire dataset. 

Definition 2.3 (regret and regret ratio |fl4ll ). There- 
gret for a subset S C D on f £ J- is: 

ro{S, f) = gain(D, f) - gain(S, /). 

The regret ratio is: 



Since the best score for / is 0.840 the regret for the running 
example S is (0.840 - 0.828) and the regret ratio is (0.840 - 
0.828)/0.840. We generalise this to fc-regret by evaluating how 
well the gain of a subset approaches the k-gain of the entire dataset. 
Note, again, that this reduces to Definition|2.3|if fc = 1. 



Definition 2.4 (fc-REGRET and A:-regret ratio). The fc- 
regret is: 

kro{S, f) — ma,x(kgain(D, f) — gain(S, /), 0). 
The fc-regret ratio is: 

kr a (S,f) 



krr D (S, /) = 



kgain(DJ)' 



rr D (S,f) 



rp(SJ) 
gain(D, /) ' 



Since Durant is the second highest scoring tuple in D for /, 
the 2-regret ratio of S = {Bryant, Durant, Wade} is (0.828 - 
0.828)/0.828 = 0. The subset S perfectly matches the top-2 re- 
quirement for utility function /. Finally, 

Definition 2.5 (maximum fc-REGRET ratio). The maxi- 
mum fc-regret ratio for a subset SCO with respect to a family of 
utility functions J- is: 

krr D (S, T) = sup fe:F krr D (S, /). 

The maximum k-regret ratio is the largest observable fc-regret ra- 
tio for any utility function in an entire family. For S the 1-regret 
ratio is maximized for g (pts, rebs) = rebs, at which the best 
score obtainable is S is 0.655 and the 1-regret ratio is (1.000 — 
0.655)/1.000 and the 2-regret ratio is (0.771 - 0.655)/0.771. 

Finally, a fc-regret minimizing set of order c is simply one with 
cardinality c that minimizes the maximum fc-regret ratio. There 
exist optimal fc-regret minimizing sets of order c, which are those 
that achieve minimal maximum fc-regret ratio of all subsets of size 
c. 

Definition 2.6 (optimal /c-regret minimizing set). An 
optimal fc-regret minimizing set of order c on a dataset D given a 
family of utility functions J- is: 

S C (D,T) = argmin S c D> | S |< c fcrr_D(S, T). 

As well, Definition ^. 6| reduces to that of Nanongkai et al. 1 141 if 

fc = 1. 

2.2 Arrangements of Lines 

The algorithms that we propose in this paper are geometric in na- 
ture and operate on arrangements of hyperplanes in dual space. Ar- 
rangements of hyperplanes (or lines, in two dimensions), are well 
studied in Computational Geometry and are induced by the inter- 
sections of a set of hyperplanes. 

DEFINITION 2.7 (ARRANGEMENT). An arrangement of a set 
of d-dimensional hyperplanes H, denoted An, is a partitioning of 
R d into cells, edges, and vertices. Each cell is a connected com- 
ponent of R d \ H. Each vertex is an intersection point of some d 
hyperplanes in H. An edge is a line segment between two vertices 
of A. 

We arrive at an arrangement of hyperplanes by applying the du- 
ality transform introduced by Chester et al. (7), which fixes an ar- 
bitrary positive real r and converts every point pi 6 D to a hy- 
perplane hi by considering p as a vector p and constructing the 
hyperplane hi to be all vectors x that solve p- x = r. 




Figure 3: The eight basketball players from Table [TJ consid- 
ering only the attributes points and rebounds, both normalized. 
The tuples are transformed into translated nullspace equations, 
and the resulting arrangement of lines is shown. Also depicted 
in thicker, light magenta lines is the second top-fc contour 0, a 
succinct representation of the 2nd ranked tuples for any top-fc 
query. 

Definition 2.8 (translated nullspace transform ||6lD. 
Given a fixed positive real, r, and a dataset of d-dimensional points 
D, the translated nullspace transform transform each primal space 
point pi £ D into a dual space (d-l)-hyperplane hi (or line U in 
two dimensions ) composed of all vector solutions to the equation 
p~i ■ x = T. 

For the basketball example, considering only the attributes points 
and rebounds, which have first been normalized, the arrangement 
of lines produced by the translated nullspace duality transform is 
illustrated in Figure[3] Note that the intersection points of two lines 
U and lj occur exactly in the direction of the vector / for which 
/(Pi) = /(Pi). 

Two other important concepts that are central ideas in Compu- 
tational Geometry and of high relevance to this paper are lower 
envelopes of arrangements of lines and convex chains within ar- 
rangements of lines. 

Definition 2.9 (lower envelope). The lower envelope of 
an arrangement of lines is the set of edges under which no other 
edges exist. 

For the purposes of this paper, in which we consider only the 
positive quadrant of Euclidean space, the lower envelope is the set 
of edges closest to the origin, O. 

Definition 2.10 (convex chain). A convex chain in an 
arrangement of lines Ac is the lower envelope in the arrangement 
of some subset £' C C of lines, Ac 1 - 

Alternatively, a convex chain can be considered to be any set of 
edges in the arrangement that form a convex polygon with O. 

2.3 Top-k Depth Contours 

We also recall here two definitions to establish what are top-fc 
depth contours, since they form the basis of our algorithms. 

Definition 2.1 1 (Top-fc Depth (71). The top-fc rank depth 
of a point p within an arrangement A is the number of edges of A 
between p and the origin. That is to say, the depth ofp is the num- 
ber of intersections between edges of A and [O, p]o Similarly, the 

2 We remark that this is identical to the more familiar concept of 
a level if all the lines pass through the positive quadrant, as we 
assume here. Nonetheless, we adopt this definition because there 
is no reason why the techniques described in this paper cannot be 
extended easily to handle attributes that range into negative values. 



top-fc rank depth of a cell or edge of A is the top-k rank depth of 
every point within that cell. 

In their paper, Chester et al. show that the rank of a point in 
a dataset D is precisely its top-fc rank depth, and that top-fc rank 
depth creates a series of n contours in R d , the i'th of which is 
comprised of the transformed points that had rank exactly i in D. 

Definition 2. 12 (TOP-fc rank depth contour 0). Atop- 
fc rank depth contour is the set of edges in an arrangement Ac that 
have top-k rank depth exactly k. 

2.4 Problem Definition 

Now, we can formally describe the problem under study in this 
paper: 

Problem Definition 1. Given any integer c and set D of n d- 

dimensional points, find an optimal fc-regret minimizing set of or- 
der c, S C (D,U + ), for the family of positive unit linear functions 
U+. 

3. A CONTOUR VIEW OF REGRET 

In this section, we show that the concept of regret that was in- 
troduced by Nanongkai et al. 1141 -and also the generalisation we 
introduce in this paper-are strongly connected to the dual space 
concept of top-fc depth contours introduced by Chester et al. (7J. 
More precisely, we prove Theorem [T] which equates the problem 
of finding an optimal fc-regret minimizing set to a dual space prob- 
lem of finding a set of lines that are "closest" to the top-fc depth 
contour. This alternative formulation of the problem facilitates de- 
signing algorithms in the dual space for two dimensions (Section[4]l 
and general dimension (Section [5} to find optimal regret minimiz- 
ing sets. 

The argument proceeds by showing that the contour itself, Ck, 
is the optimal solution, provided that it is small enough, \Ck\ < c 
(Lemma l3.lt . In the dual space, the regret ratio of a line relative 
to another line, given a utility function /, is given by the relative 
distances of the lines from the origin in the direction indicated by / 
(Lemma [3.2t . So, the evaluation of regret in dual space is a scaled 
Euclidean distance computation. 

We also show that the best options available to users within a set 
of points S C D is exactly given in the the set of dual lines of S, 
their lower envelope (Lemma [3.4| >. So, minimizing the scaled dis- 
tance of that envelope from the contour yields an optimal solution 
(Theorem[TJ. 

LEMMA 3.1. The set of points contributing to Ck is a k-regret 
minimizing set S C (D, £ + ) !/|Cfc| < c. 

PROOF. Ck is constructed such that, for any linear function / 6 
C + , apointponCfc has rank exactly fc. Therefore, krro({p}, f) = 
0,V/G£+. □ 

To summarize Lemma UTl the contour is necessarily an optimal 
solution for any c > \Ck \ because it is a representation of the fc'th 
ranked tuple for any linear utility function, and the fc-regret ratio of 
the fc'th ranked tuple is 0. 

Since the fc-contour represents the barrier of no fc-regret, any 
points transformed to lines farther from the origin O than the con- 
tour have positive regret proportional to the distance from O. Con- 
versely, any points transformed to lines closer to the origin than Ck 
with respect to / have fc-regret=0, because they are within the top-fc 
on /. 



LEMMA 3.2. For any utility function f £ W + and tuple pi ED 
transformed to line U £ C, let Ac k denote the distance of Ck from 
O with respect to f, Ai, the distance of U from O, and A' > 
denote the distance of U from Ck- Then, krrr>({piY) ~ ^ I ^-i- 

PROOF. Recall that each line U 6 C is constructed of vectors x 
such that pi ■ x = r. So, since 1 1/| = 1, the distance of U from O 
in the direction of / is r/f(pi). 

Thus, if pc k £ D represents a point on Ck in the direction of /, 

A' = A, — Ac k 

T T 

f(PC k )-f( P i) 
f(Pi)f(PC h ) 

= Ai(krr D ({p t })) 

□ 

COROLLARY 3.3. For a fixed f £ U + , let Ai denote the non- 
negative distance for some line U 6 £ from Ck in the direction of 
f. Then, Ai oc krr D ({pi}, /). 

Lemma [3~2l establishes that the the regret for a singleton set {p;} 
on a function / in dual space is related to the Euclidean distance of 
the transformed line U to O and to Ck- Corollary 13.31 notes that, 
if we consider only a single utility function, then the regret for 
different singleton sets can be straightforwardly compared by the 
distance from Ck, since the distance of Ck to O is static. Next, we 
show that, given a non-singleton set, the maximum fc-regret can 
be evaluated efficiently, because it is to be observed on the lower 
envelope in the dual space. 

LEMMA 3.4. For a set S C D, let C denote the set of lines 
produced by transforming every point pi £ S into its translated 
nullspace, U. The lower envelope of £ captures the maximum gain 
(and, ergo, minimum regret) of S for any f £ IA + . 

PROOF. For any / £ U + , the nearest line to O is that line / 
which is on the lower envelope in the direction of /. Since / has 
the smallest distance to O of all lines in C, it also has the small- 
est distance (possibly negative) to Ck of all lines in C. By Corol- 
lary [33] pi then also has the minimum regret with respect to / of 
all peS. □ 

Furthermore, while the lower envelope captures all the interest- 
ingness of a set S, there are only particular points on the envelope 
where the maximum regret can occur: the points where the distance 
ratio could be maximized. Lemma [331 confirms that these points 
are exactly the vertices of the lower envelope and the vertices of 
the contour. 

LEMMA 3.5. Consider a contour Ck and a convex chain of line 
segments C. Let Ai denote the distance of C from O in the direction 
of f and, similarly, A', the distance of C to Ck- The expression 
A' I 'Ai is maximized either at a vertex of Ck or a vertex ofC. 

PROOF. Both Ck and C are piecewise linear, so the expression 
A'/A; can only be maximized at some junction point. □ 

Since Lemmata 13.41 and 13.51 permit evaluating regret in the dual 
space of translated nullspaces, we can derive an alternative view 
of the problem of finding an optimal fc-regret minimizing set, as 
shown in Theorem[T]and illustrated in Figure[4] 




Figure 4: An illustration of 2-regret. Shown left are two dif- 
ferent order-1 sets, {Durant} and {Randolph}, along with the 
second contour. The axes are the directions of maximum regret 
for the respective sets. To the right is shown the (non-optimal) 
order-2 set {Randolph, Wade}. 

THEOREM 1. Let g denote the function which transforms any 
point p £ D to its translated nullspace and let Ck denote the top-k 
depth contour of D. The optimal k-regret minimizing set S of D 
with size at most c on the family of positive unit linear functions IA + 
is exactly the set of lines C = {g(p),p £ S}, |£] < c, the lower 
envelope £ of which minimizes the maximum ratio of distances from 
£ to Ck and £ to O at any vertex of £ and ofCk- 

PROOF. From Lemma [3~2l the regret ratio for each point in S is 
equivalent to the ratio of £ to Ck and £ to O, and from Lemma [3~4l 
the best such ratio is on the lower envelope of C. From Lemma [3~5l 
this must occur at either a vertex of Ck or a vertex of £. □ 

A final remark is with regard to the two dimensional case, in par- 
ticular. We note that any lower envelope is, in fact, a convex chain, 
so the two dimensional problem can be viewed rather as searching 
for the best convex chain. 

LEMMA 3.6. Let g denote the function which transforms any 
point p £ D to its translated nullspace and let Ck denote the top-k 
depth contour of D. The optimal k-regret minimizing set SCDC 
K 2 with size at most c is exactly the convex chain C through the 
arrangment of lines C = {g(p),p £ S} that has at most c — 1 
turns and that minimizes the maximum ratio of the distance from C 
to Ck and the distance of C to O. 

PROOF. Note that any lower envelope of a set of lines is, in fact, 
a convex chain and that, by convexity, any line can appear at most 
once on the lower envelope. So, a convex chain with c lines will 
have c — 1 turns. Also, any convex chain with c — 1 turns that is 
optimal must be a lower envelope of some set of lines, otherwise 
the sequence of turns that follows the lower envelope of the same 
set of lines will be more optimal. By TheoremQ] this is the optimal 
fc-regret minimizing set of size c for D. □ 

4. AN ALGORITHM FOR TWO 
DIMENSIONS 

As we showed in Lemma 13.61 solving the regret minimizing 
problem reduces to finding the best convex chain C with fewer than 
c turns through an arrangement. The optimal solution is the one 
which minimizes the distance ratio of C to Ck and C to O. There 
are potentially P 1 ) different convex chains with at most c — 1 turns; 
however, so we need to improve upon the 0(n c ) naive algorithm 
which tries every combination. 

We offer here a plane sweep, dynamic programming algorithm 
which runs 0(cn 2 ) and independently of k. The algorithm fol- 
lows each of the n translated nullspace lines in £ radially from the 




(a) The arrangement of lines after initial- 
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Figure 5: The initialisation of the algorithm for two dimensions. The sweep ray begins at the y-axis. The priority queue Q is 
populated with intersection points of lines that neighbour on the y-axis and intersect in the positive quadrant, sorted by the order in 
which r will pass through them. Every entry in the path list V is originally set to the distance of the corresponding line to the contour 
along the y-axis. 



y-axis to the a>axis along a sweep line r, processing intersection 
points, and remembering c of the best paths yet encountered for 
each line. We maintain three data structures, each which maintains 
some invariant with respect to the current position of r. The opti- 
mal solution can then be read from one of the data structures at the 
conclusion of the radial plane sweep. 

As one scans the plane, the data structures need to be updated 
whereever two lines intersect in order to reflect the changed state 
of the arrangement of lines and to maintain the invariants. All three 
of our data structures need to be updated at and only at line inter- 
sections. 

4.1 Data Structures 

As mentioned, there are three data structures. The first is the set 
of lines, C, sorted by their distance from the origin in the direction 
of r. The second is a priority queue, Q, containing intersection 
(or event) points yet to be processed, sorted by the order in which 
r will pass through them. The third data structure, V is for the 
dynamic programming component and maintains for each line the 
best solutions seen between the y-axis and the current position of 
r. The structure V is an n by c matrix. In each cell (i, j) is stored 
the optimum convex chain from r back to the j/-axis which both 
contains at most j lines and ends on line U. 

4.1.1 Data structure transitions 

We first describe the algorithm by how the data structures evolve 
at each intersection point. Both Q and C behave as in traditional 
plane sweep algorithms, whereas V behaves as in a traditional dy- 
namic programming algorithm. 



C 

Consider an intersection point pij, the intersection of lines U and 
lj. Because the lines are intersecting, we know they are immedi- 
ately adjacent in C. We swap U and lj in C to reflect the fact that 
immediately after p%,j, they will have opposite order as compared 
to beforehand. 

Q 

The priority queue contains all those intersection points that are be- 
tween r and the positive a;-axis that feature two lines which have 



been adjacent at some point between the positive y-axis and r. 
Again, consider an intersection point pij. Immediately thereafter, 
lines U and lj have been swapped. As such, there are potentially 
two new intersection points to add to Q, namely U and his new 
neighbour (should one exist) and lj and his new neighbour (again, 
should one exist). Both these intersection points are added to the 
appropriate place in Q, provided that they are between r and the 
positive a>axis. The point pij is removed. 

V 

Again, consider an intersection point pij featuring lines /, and lj. 
Let h be farther from the origin than lj in the direction of a ray 
after r. There are three valid paths through pij, as illustrated in 
Figure [6] 

First consider the line, U, that emerges above after pij (line l 2 in 
Figure [6j, Consider also the row of c cells of V that describe best 
paths for U. The h'th such cell, describes the chain with optimum 
distance to Ck that uses at most h — 1 turns and emerges from 
Pi t j along line U. Because the turn (lj, U) is invalid, paths for U 
cannot change, only their costs. The cost is updated to the larger of 
what the value was before and the distance from pij to Ck in the 
direction of r. 

So, considering first a chain that leaves along lj, the best possible 
route to the next intersection point of lj is exactly whatever was 




Figure 6: An illustration of the three possible paths through 
an intersection point. Either line h or l 2 could simply pass 
through. Because an envelope of lines must form a convex 
chain, on the other hand, only l 2 has the luxury of turning onto 
h. The path (h, h) requires an "illegal" concave turn. 




(a) The arrangement of lines immedi- 
ately before processing the darker point, 
(Stoudemire, Randolph). 
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(b) The priority queue, including all in- 
tersection points between r and the a;-axis 
which have been discovered between the 
y-axis and r. The grayed entry is the one 
being processed. The bolded entry is the 
one that is newly added by processing this 
point. 
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(c) The best cost values for each line as of 
the processing of the darker point in (a). The 
bold indicates the sustained value in the ta- 
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Figure 7: The processing of the tenth event point, the intersection of the lines corresponding to Stoudemire and Randolph. The 
distance along r of the intersection point from the contour is 0.034. In this case, the newly discovered length-2 chain, (Stoudemire, 
Randolph), has cost maic(0.105, 0.034), which does not improve on the value already found for Randolph, 0.083. 



the best possible route along lj to pij. The cost of that route is 
the larger of the distance from pij to Ck and the cost of the best 
possible route to the previous intersection point of lj . This value is 
updated for each of the m cells in row i. 

On the other hand, for a chain that emerges along lj, there are 
two possibilities, depending on the best route to get top»,j. Specif- 
ically, the best route of h turns to pij is the cheaper of the best 
route to the previous intersection point along lj that used h turns 
and the best route to the previous intersection point along U that 
used h — 1 turns. The final cost is then the larger of the distance 
from pi.j to Ck and the minimum cost to pij as just described. 

4.2 Algorithm Description 

As we have hinted, the algorithm is a plane sweep through the 
arrangement of lines, searching for the minimal cost convex chain 
with fewer than c turns. The sweep features a ray r, originally posi- 
tioned on the positive y-axis, moving radially through the positive 
quadrant to the positive a;-axis. This plane sweep approach is ap- 
propriate as a consequence of Lemma 13.51 which reveals that the 
cost of any convex chain is maximized at an event point. 

To initialize the algorithm, the intersection point of every line 
with the positive y-axis is processed, populating the data structures 
as in Figure [5] We only add to the priority queue the intersection 
points of lines that are immediate neighbours with respect to a sort 
on y-intercept, and for which the intersection point occurs in the 
positive quadrant. These points are maintained in the queue in de- 
scending order of angle from the positive x-axis (i.e., in the order 
in which the ray r will "sweep" them). The array V is initialised 
with empty paths for every cell, with a cost set to the distance of 
the relevant line to the contour. 

The algorithm proceeds simply by popping the next event from 
Q, updating the data structures as per Section 14.1.1 1 and pushing 
the new event points onto the queue. For the running basketball 
example, this is illustrated in Figure [7] For event points that cor- 
respond to vertices of the contour (since these, too, are intersec- 
tion points of lines that will eventually be discovered by the plane 
sweep), we update every cell of V with new maximum costs for 
each line that has become more distant from Ck- 

Eventually, Q will be exhausted as r reaches the positive a>axis 
(see Figure [8j. Every cell is updated with new maximum costs at 
this last contour vertex (the intersection of the contour with the x- 
axis). The final step is to scan through all of V and determine the 



smallest cost. This is the optimal solution, which is reported along 
with the path used to obtain it. 

AlgorifhmQjdescribes the algorithm with greater detail. 

4.3 Asymptotic Complexity 

THEOREM 2. Algorithm [7]/or the two-dimensional case finds 
an optimal k-regret minimizing set of order c in 0(cn 2 ) time with 
0(n 2 ) space. 

PROOF. First consider space used. The size of the contour is 
bounded by n. Of the three data structures, £ is of size exactly 
n, V is of size exactly n * c < n 2 , and Q is proportional to the 
largest number of intersection points that have been discovered but 
not processed, clearly less than n * (n — 1). Therefore, the total 
space is 0(n 2 ). 

Regarding time, for each non-contour event point, of which there 
may be up to n 2 , up to 2c cells of V are updated. For each contour 
event point, of which there are \Ck\ < n, each of the nc cells of V 
could potentially be updated. The initialisation requires a sort of n 
lines and then an initialisation of up to n — 1 insertions into Q and 
nc values for V, which can be computed in constant time. At the 
conclusion of the plane sweep, all nc cells of V must be scanned. 
Therefore, the entire procedure is 0(n 2 c). 

□ 

5. AN ALGORITHM FOR GENERAL DIMEN- 
SION 

In Section |U we gave an efficient plane sweep, dynamic pro- 
gramming algorithm to find optimal fc-regret minimizing sets in 
two dimensions. Unfortunately, plane sweep algorithms do not 
readily generalize to higher dimensions. So, in this section, we 
offer a greedy algorithm which exploits Lemma 15.11 in order to 
progress towards an optimal solution. 

LEMMA 5.1. Consider the envelope produced by some set C. 
For another set C! to better minimize regret, it is necessary that 
some line of L' \ L either passes through the area between the 
contour and the envelope or remains entirely under the contour at 
the angle for which the distance ratio for S is maximized. 




(a) The arrangement of lines after pro- 
cessing every event point, emptying Q. 
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(b) The best cost values for each line at the 
conclusion of the plane sweep. The bold 
entries represent optimal solutions for each 
j < m. 



Figure 8: The data structures at the conclusion of the plane sweep, when both r reaches the a>axis and Q is empty. Each cell (h, i) 
of V contains the optimum solution that contains i or fewer lines and ends with line lh- The minimum value in the entire table is the 
optimal solution for this two-dimensional j-Regret problem. 



PROOF. This results from Corollary 13.31 which indicates that 
for that fixed utility function, distance is directly proportional to 
the regret ratio. The L2 distance metric is transitive. So, if no line 
in £' intersects the area underneath the lower envelope of £ at the 
angle for which the distance ratio is maximized, then its maximum 
distance is clearly larger. □ 

We use the insight of Lemma |5~T| to design a greedy algorithm 
which behaves as follows. Consider a given set of lines S for which 
the distance ratio is maximized at point p. Consider also a new line 
I £ S. We advance to a new set S' exactly when I passes between 
the origin and p and then if there is some element I' 6 S such that 
the maximum distance ratio of S \ {l'} U {1} to Ck is less than that 
at p. That is to say, if I intersects the segment [0,p], then we look 
for a new, better set that can be obtained by replacing some element 
of S with I. We know from Lemma [5~T1 that the intersection test is 
a necessary condition to find some improved solution 5". 

Overall, the algorithm begins with an initial seed solution and cy- 
cles through all lines repeatedly, conducting the above test in order 
to refine 5", until no line can improve the cost any more. We report 
this solution. The greedy algorithm is feasible because, given the 
contour insight that we have developed in this paper, one can effi- 
ciently determine which of two sets is more optimal: it is the one 
for which the maximum distance of the lower envelope of the set 
from Ck is minimized. The algorithm is detailed in Algorithmic] 

Note that the algorithm is guaranteed to terminate because it will 
always progress towards a better solution due to Line 13, and there 
are finitely many subsets of L. 



6. RELATED WORK 

The idea of representing an entire dataset with a few representa- 
tive tuples for multi-criteria decision making has drawn much re- 
search attention in the past decade, since the introduce of the Sky- 
line operator by Borzsonyi et al. (2). However, its susceptibility to 
the curse of dimensionality is well-known. Chan et al. @ made a 
compelling case for this, demonstrating that on the NBA basketball 
dataset (as it was at the time), more than 1 in 20 tuples appear in 
the skyline in high dimensions. Consequently, there have been nu- 
merous efforts to derive a smaller cardinality representative subset 
(e.g., [3 13 20|), especially one that presents very distinct tuples 
(e.g., (9][H)). 

Regret and regret minimizing sets are relatively new in the lin- 
eage of these efforts. When introduced by Nanongkai et al. 1141 . the 
emphasis was on proving that the maximum regret ratio is bounded 
by: 

d- 1 

(c-d+iy- 1 + d-l' 

Naturally, this bound holds for our generalisation introduced in 
this paper. As far as we know, no research has yet concerned itself 
with finding optimal regret minimizing sets. 

The top-fc query off which regret is based is well studied. The 
pareto-dominance graph 1211 uses ideas of pareto-optimality to in- 
dex for top-fc queries. The Onion Technique [ 5 1 is a depth-based 
approach, but suffers the same curse of dimensionality as the sky- 
line. Ilyas offers a nice survey on the topic of top-fc queries 1121 . 
Duality transforms are pervasive in this research area (e.g., ||8lll7l ). 
We use the duality transform of Chester et al. (6][7) because of the 
immediate results on top-fc depth contours it provides in answering 
the reverse top-fc queries of Vlachou et al. 1 1 811 191 . 

Using duality transforms on data points casts the problem into 
the context of arrangements. In two dimensions, plane sweep algo- 
rithms || 10| are a typical approach to solving problems on arrange- 
ments of lines. Agarwal et al. |T) give bounds on the the number 
of edges and vertices that can exist at each level (or depth) of an 
arrangement. Top-fc depth contours are not the only notion of con- 
tours or depth in arrangements of lines: Hugg et al. 11 11 evaluate 
several depth measures and Zuo et al. 1221 derive general stastical 
results that apply to many of them and could be useful in extending 
this work. Rousseuw and Hubert consider depth in arrangements 
for dimensions greater than two 1151 , which could present deeper 
insight into the greey algorithm presented here. 



7. CONCLUSION 

Regret minimizing sets are a nice alternative to skyline as a suc- 
cinct representative subset of a dataset, but suffer from a very strict 
assumption that users expect to see their top-1 choice for their 
queries. We generalised the notion to that of k-regret minimizing 
sets, which evaluates how representative a subset of a dataset is 
not by how closely it approximates every users' top-1 choice, but 
their top-fc choice. We showed that in dual space, the top-fc depth 
contour of a dataset is exactly the optimal fc-regret minimizing set. 
If the cardinality of the fc-regret minimizing set is specified as an 
input parameter, then the convex chain that minimizes the ratio of 
distances from itself to the contour and the contour to the origin is 
precisely the optimal solution. We used these ideas to construct an 
0(n 2 c) optimal algorithm for two dimensions and a greedy algo- 
rithm for general dimension. 
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APPENDIX 

A. ALGORITHM PSEUDOCODE 



Algorithm 1 Calculating an optimal fc-max-regret minimizing set 
S from D C R 2 with \S\ < c 
1: Input: Ck', c; £, sorted by y-intercept 

2: Output: S C £, the lines that together form an optimal solu- 
tion S with \S\ < c 
3: if \C k \ < cthen 
4: Return C k 
5: end if 

6: Initialize Q as an empty priority queue; priority is angle of 

points, desc 
7: for all I G £ do 

8: Set V(l) = k * [(y— intercept, max(y— intercept — 
2/— interceptof Cfe, 0)] 
Add to Q intersect(i and next I) if not last I 
end for 

while Q is not empty do 
Let p be next point in Q 
Let A be distance ratio of p to Ck and Ck to O 
if p G Ck then 
for all (v G 7>) do 

Let A' be distance of ratio of line to Ck and Ck to O 
Let v be max(y, A') 
end for 
end if 

Retrieve adjacent h,h+i that intersect at p 
Add intersect(7;_i , h+i) if angle less than that of p 
Add intersect(7; , U+2) if angle less than that of p 
for all j G [0, c) do 

LetP(Zi)j = ma,x{V(h)j, A) 
if j > then 
ifT^i+i^-i < P(Zi +1 )i then 
Add p to path of 
LetV(h+i)j = max(7 ;, (Zi4 
else 

max('P(/^ 
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28 
29 
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32 
33 
34 
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36 
37 
38 
39 

40: 
41: 



i,A) 
0i,A) 



LetP(Z i+ i) 
end if 
else 

LetV{h+i)j = max(P(/ i+ i)j, A) 
end if 
end for 

Swap fa and h+i 
end while 

foralH G C, j G [0,c) do 

Remember v = if smallest yet seen, breaking ties 

with smaller j 
end for 

RETURN set of lines generating path corresponding to v 



Algorithm 2 Greedy algorithm to compute fc-max-regret minimiz- 
ing set S from D C R d with 15*1 < m 
IT 

2: 



9 
10 
11 
12 
13 

14 
15 
16 
17 
18 
19 
20 
21 



Input: Ck',m;C 

Output: S C £, the lines that together form a solution S with 

|S| < m 

if |Cfc| < m then 

Return Ck 
end if 

Select an arbitrary set S C £, such that S| = m 

Let p be point at which distance ratio from lower envelope of 

S to Ck is maximized 

Place all lines U £ S into unsorted queue, Q. 
while Q is not empty do 
Let I be next line in Q 
if I intersects [0,p] then 
for all (I' G S) do 

if Distance ratio of S\ {^'}U {1} to Ck < distance ratio 
of S to Ck then 

LetSbeS , \{7'}U{J} 
Restore to Q all 1, 5 
Break 
end if 
end for 
end if 
end while 
RETURN S 



